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ABSTRACT 

There is an increasing use of some imperceivable and redun- 
dant local features for face recognition. While only a rela- 
tively small fraction of them is relevant to the final recognition 
task, the feature selection is a crucial and necessary step to se- 
lect the most discriminant ones to obtain a compact face rep- 
resentation. In this paper, we investigate the sparsity-enforced 
regularization-based feature selection methods and propose a 
multi-task feature selection method for building person spe- 
cific models for face verification. We assume that the person 
specific models share a common subset of features and nov- 
elly reformulated the common subset selection problem as a 
simultaneous sparse approximation problem. The effective- 
ness of the proposed methods is verified with the challenging 
LFW face databases. 

Index Terms — Person specific face verification, feature 
selection, multi-task learning, simultaneous sparse approxi- 
mation 

1. INTRODUCTION 

Although face recognition has achieved significant progress 
under controlled conditions in the past decades, it is still a 
very challenging problem in the uncontrolled environment 
such as the web where pose, lighting, expression, age, oc- 
clusion and makeup variations are more complicated. As 
local areas are often more descriptive and more appropriate 
for dealing with those variations, there is an increasing use 
of some imperceivable local features for face verification. 
Those local descriptors are generally extracted by performing 
some transformation (both linear or nonlinear) on the local 
region only or followed by some explicitly spatial pooling 
means such as the spatial histogramming scheme [1J. These 
initial representation is often redundant or over-completed, 
whereas only a relatively small fraction of them is relevant 
to the recognition task. Thus feature selection is a crucial 
and necessary step to select the most discriminant ones from 
the local features to obtain a compact face representation, 
which can not only improve performance but also decrease 
the computational burden. 



Adaboost-based method is the most popular and impres- 
sive feature selection methods in face recognition Scenario 
EIGJIUEI. It applies the simple weak classifier, which only 
consists in a threshold on the value of a single feature, many 
times on differently weighted version of data and therefore 
obtaining a sequence of weak classifiers corresponding to the 
selected features. One possible problem of these methods is 
very time consuming because of the need of training and eval- 
uating a different classifier for each feature. An alternative 
is the sparsity-enforced regularization techniques [6| which 
is the state-of-the-art feature selection tool in bioinformatics 
and recently has been successful applied in face detection and 
verification [7]. The main advantages of the regularization 
approach are its effectiveness even in the high dimensional- 
ity small sample size cases coupled with the support of well- 
grounded theory ||6] . 

The concern of this work is mainly about how to build per- 
son specific models for both feature selection and face verifi- 
cation in unconstrained environments. In this case, although 
the face verification can be seen as a binary classification 
problem (accept or reject), it is in fact several binary clas- 
sification problems (one for each client model) and thus its 
essence is by nature a multiple binary classification problem. 
Most existing approaches train a generic model for all individ- 
uals 0|51 3], which may fail to capture the variations among 
different individuals and therefore are suboptimal. Other ap- 
proaches build person specific models for different individu- 
als separately |7| and often lead to overfitting due to the small 
sample size of each individual. To combat over the overfitting 
problem, Wang et al. [5| adopted multi-task learning to im- 
prove the generalization performance of the Adaboost-based 
methods. 

In this paper, we investigate the multi-task generalization 
of regularized methods and propose a multi-task feature selec- 
tion method for person specific face verification. We assume 
that different person specific models share a common subset 
of relevant features and novelly reformulate the common sub- 
set selection problem as a simultaneous sparse approximation 
problem. The classification can be done by simple linear re- 
gression such as ridge regression. The experiment results on 



the LFW face database [8| demonstrate the advantages and 
effectiveness of the proposed methods. 

2. NOTATION AND SETUP 

Suppose that there are L individuals to be verified. Given 
a training image set of size N, among them N[ images cor- 
respond to the subject I, while the remaining images are of 
other subjects excluding the known L subjects. From each 
image we can obtain a d-dimensional feature vector f . Let 
X = [xi, . . . , x<j] e M A ' xd be the data matrix with each row 
an input feature vector, and Y = [y l7 . . . , y^] 6 M A ' xL be 
the corresponding indicator response matrix where y; is a N- 
dimensional vector with its ith entry equal to 1 if the ith sam- 
ples come from the subject / and else equal to 0. Therefore 
Y is a matrix of O's and l's with each row having at most a 
single 1. We write C; for the l-th column of the matrix C and 
c ( for the l-th row. 

3. SINGLE-TASK FEATURE SELECTION 

We restrict ourselves to the case of regression models that are 
linear in the components of feature. For class I, this linear 
relationship can be characterized in matrix notation 

y, =Xc I + &il,(I = l,...,L) (1) 

where c; is a rf-dimensional coefficient vector and bi is the 
bias in the model of class I respectively, while 1 being a vec- 
tor with its entries equal to 1. The square error loss function 

Err(c,,6,) = ||yi-Xc,-6,1||2 ( 2 ) 

is adopted to fit the above linear model to the given training 
set. Minimizing the square error loss function directly yields 
a unique solution known as the least squares solution, which 
is typically non-sparse and thus do not provide the feature 
selection in the sense. A natural generalization for feature 
selection is lo regularization 

minErr(ci,&{) +A||cj|| , (3) 

where || • || is the l a quasi-norm counting the nonzero en- 
tries of a vector and A quantifies how much improvement in 
the approximation error is necessary before we admit an addi- 
tional term into the approximation. It is a classic combinato- 
rial sparse approximation problem which is a NP-hard in gen- 
eral. A lot of numeric methods has been proposed to solve the 
above combinatorial sparse approximation problem and two 
most common approaches are greedy methods and convex re- 
laxation methods. Greedy techniques such as OMP abandon 
exhaustive search but iteratively construct a sparse approxi- 
mate one step at a time by selecting the columns maximally 
reduces the residual and use it to update the current approxi- 
mation. Convex relaxation methods replace the combinatorial 



sparse approximation problems with a related convex version 
that can be solved more efficiently. As the li norm provides a 
natural convex relaxation of the Iq quasi-norm, the basis pur- 
suit (BP) method solves the sparse approximation problem by 
introducing an li norm in place of the lo quasi-norm 

minErr(c / ,fe / ) + A||c/||i, (4) 

which is an unconstrained convex function and can be solved 
by some standard mathematical programming softwares. 
Similarly, the parameter A negotiates a compromise between 
approximation error and sparsity. It is also known as LASSO 

m. 

Provided the regularization coefficient A is same across 
different individuals, then solving each of these problems in- 
dependently is equivalent to solving the global problem ob- 
tained by summing the objectives: 

L L 

mi b^Z iv Err( - c '' 6 ^ + A ^" Q " ' (5) 

' 1=1 1 1=1 

Where C is the coefficient matrix with c; in columns and b 
7i I .... . 6i] T is the bias vector. Similarly, the corresponding 
li norm relaxation objective is 

L L 

mi b^Z ^Err(c z , fo z ) + A^||c(||i. (6) 

' i=i 1 i=i 

4. MULTI-TASK FEATURE SELECTION 

In this section, we will describe our proposed multi-task fea- 
ture selection in details. As mentioned before, we assume 
each face shares common subset of the redundant and im- 
perceivable local features. It's reasonable because each face 
shares a common structure, i.e. face is composed of eyebrow, 
eye, nose, mouth, etc. In the regularized feature selection 
frame, sharing a small subset of features means that the co- 
efficient matrix C has many rows which are identically equal 
to zero and the corresponding features will not be used for all 
tasks. Thus the global common feature selection can be for- 
mulated as searching minimum number of nonzero rows of C 
while balancing the error loss function 

L 1 

mi ff/] — Err(c ; ,6 ; ) + A||C|| rott) _j , (7) 
' i=i 1 

where || • \\ r ow-i is the row-^o quasi-norm which denotes the 
number of nonzero rows and is given by 

L 

l|C|| rou ,-i = |(Jsupp( C/ )|, (8) 
i=i 

where supp(-) denotes the support of a vector. When the ma- 
trix C is a column vector, the row-support degenerates to the 



support of the vector and the row-^o quasi-norm degenerates 
to the usual Iq quasi-norm. If we regard X as a dictionary and 
yi(l = 1, • • • , L) as a serious of signals to be approximated, 
the problem |7]) is indeed a simultaneous sparse approxima- 
tion problem. 

It is immediately clear that the combinatorial optimization 



problem 



7 1 is at least as hard as combinatorial optimization 



problem ( 3 1 and thus it is a more complicated NP-hard prob- 
lem in general. Some greedy pursuit algorithms such as si- 
multaneous orthogonal matching pursuit (SOMP) [9] are pro- 
posed to solve this combinatorial optimization problem. An- 
other approach to simultaneous sparse approximation is to re- 
place the row-Zo quasi-norm by a closely related convex func- 
tion ifTUll . There are many different ways to relax the row-^o 
quasi-norm and one may define an entire family of relaxations 
of the following form 



|c|U = E(ll c %) p/<? = £[£M 9 ] 



qw/q 



(9) 



= 1 3 = 1 



This relaxation can be done by first applying the l q norm to 
the rows of C and then applying the l p norm or quasi-norm to 
the resulting vector of l p norm. On the one hand, we want to 
obtain row-sparse of C. On the other hand, we want the se- 
lected feature to contribute to as many individuals as possible. 
This requires most rows of C should be zero but the nonzero 
rows should have many nonzero entries. Therefore we have 
p < 1 and q > 1. The rational behind this is that minimizing 
the l p (p < 1) norm promotes sparsity whereas minimizing 
the l q (q > 1) norm promotes non-sparsity. If set p = 1 and 
q = 2, our method is equivalent to the multi-task feature se- 
lection frame proposed in ifTTl . In our implementation, we 
set q = oo since the norm can provide better non-sparsity 
than I2 norm. Replacing the row-^o quasi-norm in the com- 
binatorial optimization problem (j7]i by its relaxation (|9]l with 
p = 1 leads to the following convex program 



L 1 

mi £E^ Err ( c '' M - 

' 1=1 1 



-A||C|| p , g , (10) 



which can be solved by some standard mathematical pro- 
gramming software IfTUll . 

Recalled that the above feature selection frame can be 
used for classification directly in that it fits linear regression 
models to the class indicator variables. One can also con- 
sider its usage as a pure feature selection tool and explore 
some other common classifiers for classification. Moreover, 
the proposed method is not specific for face verification but to 
any other classification or regression problem, providing that 
the tasks share the same training data. 

5. EXPERIMENTAL RESULTS 

We carry out some experiments on the LFW face database lf8l . 
The LFW face database contains 13, 233 labeled face images 



collected from news sites in the Internet. These images be- 
long to 5, 749 different individuals and have high variations 
in position, pose, lighting, background, camera and quality, 
which make it appropriate to evaluate face verification meth- 
ods in realistic and unconstrained environments. As there is 
not available protocol along with the database for person spe- 
cific face verification, we select 158 people with at least 10 
images in the database as the known people, i.e. L = 158. 
For each known people, we choose the former 5 images for 
training and the remaining for testing. We also select 210 peo- 
ple with only one image in the database as the background 
person (or unknown person) for training. Hence we have a 
training set of size 1,000 corresponding to 368 people and 
a testing set of size 3, 534 from the known 158 people. Note 
that our training set is overwhelmingly unbalanced (5 positive 
samples and 995 negative samples with their ratio be close to 
1 : 200 for each person). 

In our experiments, each image is rotated and scaled so 
that centers of the eyes are placed on specific pixels and then 
was cropped to 64 x 64 pixels. We choose Gabor feature as 
the initial representation due to its peculiar ability to model 
the spatial summation properties of the receptive fields of the 
so called "bar cells" in the primary visual cortex. We use 40 
Gabor filters with five scales {0, • • ■ ,4} and eight orienta- 
tions {0, • • • ,7} which are common in face recognition area 
to obtain the Gabor feature. The dimension d of the resulting 
feature is then 64 x 64 x 40 = 163, 840. 

We apply both single-task and multi-task feature selection 
approach to select the most informative 300 features from the 
original 163, 840-dimensional Gabor features. From a run- 
time point of view, OMP and SOMP are adopted to solve 
the single-task and multi-task feature selection problems, re- 
spectively. The outputs of OMP and SOMP include both 
the indexes of the selected features and the corresponding 
weights and therefore can be used for verification directly. 
We also utilize the ridge regression method to determine the 
weights of the selected features. The corresponding verifica- 
tion methods are denoted as "STL", "MTL" and "STL+R" 
and "MTL+R". In addition, we adopt the Adaboost-based 
method as the baseline for feature selection and verification. 

Those methods all can verify the training set exactly, but 
perform very differently on the testing set. We adopted the 
average ROC curves and the average area under ROC curves 
(AUC) to evaluate their performance across different persons. 
The comparative performance is shown in Fig. [T]and Table 
[T] The Adaboost-based method may suffer from the unbal- 
ance of the training set and performs much worse than the 
regularization-based methods. The proposed multi-task fea- 
ture selection methods ("MTL" and "MTL+R") perform bet- 
ter than the corresponding single-task feature selection meth- 
ods ("STL" and "STL+R"). Another observation is that the 
ridge regression-based verification does marginally improve 
the performance compared with directly using the feature se- 
lection frame for verification. This can be attributed to the fact 




False Positive Rate 

Fig. 1. Average ROC curves of verifying images of 158 
known people only using 300 Gabor features 



Table 1. The average true positive rates (TPR) using different 
methods when the false positive rate (FPR) is fixed at 0. 1 and 
the average AUC 



Methods 


TPR(std. dev.) 


AUC(std. dev.) 


STL 
MTL 
STL+R 
MTL+R 


0.8046(± 0.1600) 
0.8465(± 0.1458) 
0.8185(± 0.1636) 
0.8525(± 0.1480) 


0.8506(± 0.1255) 
0.8901(± 0.0969) 
0.9444(± 0.1458) 
0.9586(± 0.0288) 


Adaboost 


0.3112(± 0.1708) 


0.681 1(± 0.1066) 



that the sparsity-enforcement in the feature selection frame 
may underestimate the resulting coefficients and hereby ob- 
tain the worse performance. 

6. CONCLUSIONS 

We have proposed a multi-task learning method for building 
of personal specific models both for feature selection and face 
verification. The person specific models are jointly learned by 
sharing the training data and then the multi-task feature se- 
lection problem can be reformulated as a simultaneous sparse 
approximation problem which can be solved by some greedy 
algorithms such as SOMP or some related convex relaxation 
methods. The experimental results show that the proposed 
multi-task feature selection method can overcome the poten- 
tial overfitting issues due to the lack of training data and the 
adoption of ridge regression for verification can marginally 
improve the performance. 
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